Project Topic

This is a notebook for Histopathologic Cancer Detection. The task of this project is to detect cancer in photos taken by microscope. The size of image is 96x96, and when at least one of cancer cell is included in the center 32x32, it is positive. Cancer cells out of this zone is not counted, but accuracy gets much better when whole 96x96 is used.

target

Target of this project is to achieve auc > 0.9 in kaggle private score by CNN. Because of bias in data, it is even possible to achieve auc 0.8 only according to color in photo (refer to chapter 2.5 prediction by mean RGB value). Therefore, CNN must achieve more than auc 0.9.

1. Data

There are 220,0025 train data, and it takes about 30min to read all of them. To reduce data preprocessing time, image file (.npy) is created by another notebook. Those files can be loaded as numpy array. This file is re-sampled data to balance positive and negative.

data source

The original data is available in kaggle competition website.

https://www.kaggle.com/competitions/histopathologic-cancer-detection/data

numpy file is available here:

https://www.kaggle.com/datasets/hidetaketakahashi/cancerdetection-npy

Due to memory size restriction, train data size is limited to 50000 for this report.

2. EDA

2.1 Labels

In the original train data, 40% of labels are positive (cancer). To make training efficient, the train data is re-sampled to be 50% positive.

2.2 Train and Test Distribution in RGB

When test data is available, it is better to compare distribution of train and test data. As shown in the histrogram of RGB respectively, train and test has similar distribution.

2.3 Distribution of cancer and normal in RGB

2.4 Plot of photo

Since the distribution of cancer and normal are different in RGB, there should be visible difference in photos. According to images, there are lots of pink colored photo in cancer group, this color bias can be information for prediction.

Normal (non-cancer)

In the normal (non-cancer) group, most of photo are purple colored.

Cancers

There are more pink colored photo in cancer group compared with normal group.

2.5 Prediction from mean RGB value

Based on the above EDA, it seems to be possible to predict labels according to color of image. In this section, mean value of RGB of photo is calculated, and this feature is used for RandomForest Classifier.

As a result, validation auc is 0.844.

This prediction got 0.8151 in kaggle private score. It means that CNN shall be much more accurate to be worth to use GPU and longer training time.

2.6 Conclusion of EDA

train and test data have similar distribution

Since train and test images have similar distributions, it is possible to randomly select validation data from original train data.

There are bias in data

Cancer and normal have different distribution in RGB. It is even possible to predict cancer at auc 0.8 only according to the mean color of a image. Thus, CNN must achieve much higher result.

Data cleaning plan

The type of data is uint8 that is small but still occupies large size of RAM. Preprocessing data will convert it to float which is too big. So, in this project, preprocess functions (Rescale and Randomflip) will be built in CNN model.

3. CNN Tuning and Model Selection

In this chapter, 3 models of CNN are compared. Those models have 5 layers of Conv & MaxPool units.

models

Following elements are common in these three models. Those settings are determined by many trial of training. Reasons are briefly described.

Common elements

Models contain rescale and randomflip layer as preprocessing image.

As I tried some filter sizes, size = 3 was most stable and performance was better (faster learning and good accuracy) than size 5 and 7. MaxPooling layer should be after one or two Conv layers. I have tried Conv with stride = 2 instead of MaxPooling, but MaxPool was more stable.

Batch size and learning rate are closely related. I tried batch size 32, 64, and 128. 128 with default learning rate was the best. Furthermore, 50% of dropout is widely used in kaggle, and I can agree. When drop out ratio is small ( < 0.4), effect is not enough. For these model with 50,000 train images, the best epoch seems to be about 8, but I set epoch = 12, to clearly show overfitting symptoms.

5 x (Conv, MaxPool)

Teh model started overfitting after epoch 5. Train loss decreased whereas validation loss increased.

validation auc

Two predictions are compared. First, prediction from original image. It got auc 0.959. Another one is ensemble of predictions from flipped images. It got auc 0.975. The result is improved because CNN is sensitive to image's angle.

5 x (Conv Conv MaxPool)

Compared with the previous model, training accuracy is slightly lower but validation accuracy was improved. Overfitting is mitigated by additional conv layer.

validation auc

Prediction of single image and flipped images are compared.

5 x (Conv Conv MaxPool) without Dropout

Next, Dropout layer is removed from the previous architecture. Others are the same.

train accuracy kept increasing while validation accuracy stayed around 0.9, it is a sort of overfitting.

Results

Validation accuracy and loss of three models are plotted in the below figure. Model1 converged fastest, but after 4 epochs, model 2 exceeded it. The best epoch was 9 (at x = 8 in figure) in model2 that had highest accuracy of all. Thus, the model 2 is the best architecture. A similar model in other notebook achieved auc 0.93 in kaggle private score. LeakyRelu activation is selected there to avoid dead neuron problem happened in the model with relu.

4. Analysis: Where is cancer?

Even though CNN is kind of black box in calculation, we can check which part of image makes prediction confident. It is possible by selecting samples of high probability of positive. Then cropping a part of image and check prediction. When probability drops significantly, this area may contain cancer cells.

In the following case, probability dropped when bottom left zone is copped.

In the next case, probability dropped when upper middle zone is cropped.

5. Conclusion

The model2 trained with 100,000 images achieved target auc in kaggle private score. It consists of 5 layers of Conv Conv MaxPooling units with 50% Dropout at FC layer. Here are some findings from this project.

Convolution layer migated overfitting

model 1 (Conv&MaxPool) converged faster but it had obvious overfitting, and model 2 (Conv&Conv&MaxPool) mitigated overfitting. Validation accuracy was also improved by additional conv layers.

Dropout layer improved validation accuracy

Model 2 (with dropout) has better accuracy than model 3 (without dropout) that demonstrated effect of dropout. Dropout is well known technique to mitigate overfit, but it also improves accuracy according to this result.

Leaky relu solved dead neuron

It is not described in this notebook, but by comparing with other notebooks (leaky relu and relu),Leaky relu avoided dead neuron caused by relu when the model has many layers. When many of neurons are dead, prediction accuracy does not increase from 0.5.

Flipping image and ensemble predictions is effective

To get higher auc, input image for testing is flipped horizontally and vertically. Thus, one image has four prediction and they are ensembled. yhat = mean(yhat1(original), yhat2(vertical flip), yhat3 (horizontal flip), yhat4 (horizontal and vertical flip)). Almost everytime, it improved auc.

CNN can be visualized somehow

It is difficult to figure out which cells are cancers for non-professional persons. However, some technique of CNN can assist diagnosis. By cropping a part of image, and check probability of cancer by CNN, we can guess what are cancer cells look like.